Document semantic analysis/selection with knowledge creativity capability

ABSTRACT

A computer based software system and method for semantically processing a user entered natural language request to identify and store linguistic subject-action-object (SAO) structures, using such structures as key words/phrases to search local and web-based databases for downloading candidate natural language documents, semantically processing candidate document texts into candidate document SAO structures, and selecting and storing only relevant documents whose SAO structures include a match with a stored request SAO structure. Further features include analyzing relationships among relevant document SAO structures and creating new SAO structures based on such relationships that may yield new knowledge concepts and ideas for display to the user and generating and displaying natural language summaries based on the relevant document SAO structures.

REFERENCE TO PRIORITY APPLICATION

[0001] This application claims the benefit of U.S. ProvisionalApplication No. 60/099,641, filed Sep. 9, 1998.

BACKGROUND

[0002] The present invention relates to computer based natural languageprocessing systems and more particularly to computer based systems andmethods of processing natural language text to identify Subject, Action,Object triplets and relationships between such triplets, storing thisdata and processing this data to semantically analyze, select,summarize, store, and display candidate documents containing specificcontent or subject matter.

[0003] Computer based document search processors are known to performkey word searches for publications on the Internet and World Wide Web.Today, information owners and service providers are adapting theirdatabases to individual tastes and requirements. For example, Bostonbased Agents, Inc. offers over the Web personalized newsletters formusic fans such that classical music lovers are blocked from receivingRap music advertisements and vice-versa. KD, Inc. of Hong Kong hasdeveloped a system that takes into consideration words similar by sensewhile searching the Web. Today, the user can download 10,000 papers fromthe Web by typing the word “Screen”. The search system designed by KD,Inc. asks the user whether he/she is seeking papers related to ComputerScreen, TV Screen or Window Screen. In this case, the number ofunrelated papers will be drastically reduced.

[0004] Software based search processors are able to remember requests ofa single user and to conduct personalized non-stop searches on the Web.So, when a user wakes up in the morning, he/she finds references andabstracts of several new Web papers related to his/her area of interest.In 1997, practically all fundamental technical publications, journals,magazines, as well as patents of all industrial countries becameavailable on the Web, i.e., available in electronic format.

[0005] Although key word searching the Web affords the user great value,it also has created and will continue to create substantial problemsadversely affecting this value. Specifically, because of the enormousamount of information available on the Web, key word search processorsproduce too much downloaded information, the vast majority of which isirrelevant or immaterial to the information the user wants. Many userssimply give up in frustration when presented with several hundredarticles in response to what the user considered a request for onlythose few articles related to a specific request.

[0006] This problem is also experienced in the technical fields ofscience and engineering, particularly since there is a growing number oflibraries, government patent offices, universities, government researchcenters, and others adding vast amounts of technical and scientificinformation for Web access. Engineers, scientists, and doctors areoverwhelmed with too many articles, papers. patents and generalinformation on the topic of interest to them. In addition, the userpresently has only two choices when examining a downloaded article todetermine its relevance to the users project. He/she can either read theauthors abstract and/or scan various sections of the full article todetermine whether or not to save or print-out that specific document.Since the author's abstract is not comprehensive, it often omits thereference to the specific subject matter of interest to he user ortreats this subject matter in an incomprehensive manner. Thus, scanningthe abstract and scanning the full article may have little value andrequire an inordinate amount of user time.

[0007] Various attempts purport to increase the recall and precision ofthe selection such as U.S. Pat. Nos. 5,774,833 and 5,794,050incorporated herein by reference, however, these methods simply rely onkey word or phrase searching with various techniques of selection basedon variations of the key words, or purported understanding of textualphrases. These prior methods may improve recall but tend to require toomuch physical and mental effort and time to determine why the documentwas selected and what is the pertinent part. This results from theentire document or abstract being presented without summary or conceptgeneration.

SUMMARY OF EXEMPLARY EMBODIMENT OF PRESENT INVENTION

[0008] A computer based software system and method according to theprinciples of the present invention solves the foregoing problems andhas the ability to perform a non-stop search of all databases on the Webor other network for key words and to semantically process candidatedocuments for specific knowledge concepts, such as technologicalfunctions or specific physical effects, so that only the very fewprioritized or a single document meeting the search criteria ispresented or identified to the user.

[0009] Further, the computer based software system in accordance withthe principles of the present invention captures these highly relevantdocuments and creates a compressed, short summary of the precisetechnical physical aspects designated by the search criteria.

[0010] Another aspect of the present invention includes using thesemantic analysis results of the selected documents to create new ideasof knowledge concepts. The system does this by analyzing the subject,action, and object triplets mentioned in the documents, identifyingcause and effect triplet relationships, and re-organizing these tripletrepresentations into new and/or different profiles of such elements. Asfurther described below, some of these reorganized sets of relationshipsamong these elements may comprise new concepts never before thought ofby anyone.

[0011] According to an aspect of the present invention, the method andapparatus begins with the user entering natural language text related tothe task, concept, or subject matter for which the user desires toacquire publications or documents. The system analyzes this request textand automatically tags each word with a code that indicates the type ofword it is. Once all words in the request are tagged, the systemperforms a semantic analysis that, in one example, includes determiningand storing the verb groups within the first sentence of the request,then determining and storing the noun groups within that sentence of therequest. This process is repeated for all sentences in the request.

[0012] Next, the system parses each request sentence with anhierarchical algorithm into a coded framework (tree) which issubstantially indicative of the sense of the sentence. The systemincludes databases of various types to aid in generating the codedframework, such as grammar rules, parsing rules, dictionary synonyms,and the like. Once parsed, sentence codes are stored, the systemidentifies Subject-Action-Object (SAO) extractions within each sentenceand stores them. A sentence can have one, two, or a plurality of SAOextractions as seen in the detailed description below. Each extractionis normalized into a SAO structure by processing extractions accordingto certain rules described below. Accordingly, the result of thesemantic analysis routine performed on the request test is a series ofSAO structures (triplets) indicative of the content of the request.These request SAO structures are applied to (1) a comparative module forcomparing the SAO structures of candidate documents as described belowand (2) a search request and key word generator that identifies keywords and key combinations of words, and synonyms thereof, for searchingthe Web internet, intranet, and/or local databases for candidatedocuments. Any suitable search engine, e.g. Alta Vista™, can be used toidentify, select, and download candidate documents based on thegenerated key words.

[0013] It should be understood that, as mentioned above, key wordsearching produces an over-abundance of candidate documents. However,according to the principles of the present invention, the systemperforms substantially the same semantic analysis on each candidatedocument as performed on the user input search request. That is, thesystem generates an SAO structure(s) for each sentence of each candidatedocument and forward them to the comparative Unit where the request SAOstructures are compared to the candidate document SAP structures. Thosefew candidate documents having SAO structures that substantially matchthe request SAO structure profile are placed into a retrieved documentUnit where they are ranked in order of relevance. The system thensummarizes the essence of each retrieved document by synthesizing thoseSAO structures of the document that match the request SAO structures andstores this summary for user display or printout. Users can later readthe summary and decide to display or print out or delete the entireretrieved document and its SAO's.

[0014] As stated above, the SAO structures for each sentence for eachretrieved document are stored in the system according to the presentinvention. According to the knowledge creativity aspect of the presentinvention, the system analyzes all these stored structures, identifieswhere common or equivalent subjects and objects exist and reorganizes,generates, synthesizes, new SAO structures or new strings(relationships) or SAO structures for user's consideration. Some ofthese new structures or strings may by unique and comprise new solutionsto problems related to the user's requested subject matter. For example,if two structures S1-A1-O1 and S2-A2-O2 are stored, and the presentsystem recognizes that S2 is equivalent to or the synonym for or hassome other stored relation to O1 then it will generate and store for theuser's access a summary of S1-A1-S2-A2-O2. Of if the system stores anassociation between S1 and A2 it can generated S1-A1/A2-O1 to suggestimprovement of O1 toward desired results.

[0015] Other and further advantages and benefits shall become apparentwith the following detailed description when taken in view of theappended drawings, in which:

DRAWING DESCRIPTION

[0016]FIG. 1 is a pictorial representation of one exemplary embodimentof the system according to the principles of the present invention.

[0017]FIG. 2 is a schematic representation of the main architecturalelements of the system according to the present invention.

[0018]FIG. 3 is a schematic representation of the method according tothe principles of the present invention.

[0019]FIG. 4 is a schematic representation of Unit 16 of FIG. 2.

[0020]FIG. 5 is a schematic representation of Unit 20 of FIG. 2.

[0021]FIG. 6 is a schematic representation of Unite 22 of FIG. 2.

[0022]FIG. 7 is a typical example of the user request text entered byuse.

[0023]FIG. 8 is a tagged and coded representation version of text ofFIG. 7.

[0024]FIG. 9 is an identification of verb groups of the text of FIG. 8.

[0025]FIG. 10 is an identification of noun groups of the coded text ofFIG. 8.

[0026]FIG. 11 is a representation of parsed hierarchy coded text of FIG.8.

[0027]FIG. 12 is a representation of SAO extraction of the text of FIG.7.

[0028]FIG. 13 is a representation of SAO structures of the extraction ofFIG. 12.

DETAILED DESCRIPTION OF EXEMPLARY EMBODIMENTS

[0029] One exemplary embodiment of a semantic processing systemaccording to the principles of the present invention includes:

[0030] A CPU 12 that could comprise a general purpose personal computeror networked server or minicomputer with standard user input and outputdriver such as keyboard 14, mouse 16, scanner 19, CD reader 17, andprinter 18. System 10 also includes standard communication ports 21 toLANs, WANs, and/or public or private switched networks to the Web.

[0031] With reference to FIGS. 1-6, the semantic procession system 10includes a temporary storage or database 12 for receiving and storingdocuments downloaded from the Web or local area network generated as auser request text with use of keyboard 14 or one of the other inputdevices. User can type the request, examples disclosed below, or enterfull documents into DB 12 and designate the document as user's request.System 10 further includes semantic processor 14 for receiving theentire text of each document and includes a Subject-Action-Object (SAO)analyzer Unit 16 that tags each word of each sentence with a code type(such as Markov chain theory code). Unit 16 then identifies each verbgroup and noun group, (described below) within each sentence and parsesand normalizes each sentence into SAO structures that represent thesense of the sentence. Unit 16 applies its output to DB of SAOstructures 18. SAO processor Unit 20 stores the request SAO structuresand receives the SAO structures of each sentence of each document storedin Unit 18. Unit 20 compares the document SAO's to the request SAO's anddeletes out those documents with no matches. The SAO structures ofmatched documents are stored back in Unit 18 or some other storagefacility. In addition, Unit 20 analyzes SAO structures within a singledocument or with those of one or more other relevant documents, searchesfor relationships among S-A-O's and generates new SAO structures foruser consideration. These new structures are stored in Unit 18 or someother storage facility in the system.

[0032] Unit 14 further includes natural language Unit 22 that receivesSAO structures in table form and synthesizes structures in to naturallanguage form, i.e. sentences.

[0033] Unite 14 also includes keyword Unit 24 for receiving SAOstructures and extracts key words and phrases from them and acquirestheir synonyms for use as additional key words/phrases.

[0034] Database Units 26, 28, and 30 receive the outputs from Unit 14,generally as shown, for storing the natural language summaries ofselected SAO structures as described below and the key words/phrasesthat form user request sent to search engines through port 21.

[0035] Unit 16 includes document pre-formatter 32 that receives fulltext of documents from Unit 12 and converts the text and other contentsto a standard plain text format. Text coder 34 analyzes each word ofeach sentence of text and tags a code to every word which codedesignates the word type, see FIG. 8. Various databases designated 44 inFIG. 4 are available to aid the Units of Unit 16. Following tagging,recognizer Unit 36 identifies the verb groups (FIG. 9) and the noungroups of each sentence (FIG. 10). Sentence parser 38 then parses eachsentence into a hierarchical coded form that represents the sense of thesentence. FIG. 11 S-A-O extractor 40 organizes the SAO's of eachsentence into extracted table format (FIG. 12). Then normalizer 42normalizes the extractions into SAO structures as described above (FIG.13).

[0036] SAO processor 20 includes three main Units. Comparative Unit 46receives SAO structures from database 18. One set of these structuresoriginates from the user request text described above and other setsoriginate from the candidate documents. Unit 46 then compares these twosets looking for matches between SAO structures of these two sets. If nomatch results then the candidate document and associated SAO's aredeleted. If a match is identified then the document is marked relevantand ranked and stored in Unit 12 and its SAO structures stored in Unit18. Unit 46 then compares all candidate documents in sequence and in thesame way as described.

[0037] Unit 20 also includes the SAO structure reorganizing Unit 48 tosynthesize new SAO structures from different documents on the samematter and combines them into the new structure, as described above, andapplies them to Unit 18.

[0038] Filtering Unit 50 analyzes every SAO structure of each documentand blocks or deletes those not relevant to the SAO structures of therequest.

[0039] Reference 52 designates some of the databases available to aidsub-units of Unit 20.

[0040] SAO synthesizer Unit 22 (FIG. 6) includes a Subject detector 54for detecting the content of the subject for each received SAOstructure. If S is detected then the SAO is fed to Unit 56 in which thetree structure of the verb group(s) is restored to natural languageusing grammar, semantic, speech patterns, and synonyms rules database66. Synthesizer 58 does the same for subject noun groups and synthesizer60 does the same for object noun groups. Combiner 68 then organizes andcombines these groups into a natural language sentence.

[0041] If S was not detected by Unit 54, the SAO structures areprocessed by synthesizer 62 to restore the verb group in passive form.Synthesizer 64 processes the object noun group for a passive sentenceand combiner 70 to organize and combine the groups into a naturallanguage sentence.

[0042] If SAO structures received by Unit 54 bear new structuremarkings, then combiners 68 and 70 apply their output to Unit 28 and ifthey were marked existing SAO structure, then units 68, 70 apply outputto Unit 26. See FIG. 3.

[0043] The salient steps to the method according to the principles ofthe present invention are shown in FIG. 3, where the number in theparenthesis refer to the Units of FIG. 2 where the process steps takeplace. A session begins with the user inputting a natural languagerequest which could be customized with the use of the keyboard or wouldbe a natural language document entered via one of the input devicesshown in FIG. 1. A typical user generates customized request as shown inFIG. 7, System 10 Unit 14, then by first tagging each word with a typecode (See FIG. 8) then identifying the verb groups of each sentence(FIG. 9) and noun groups of each sentence (FIG. 10) then processing eachsentence into an hierarchical tree (FIG. 11) and then extracting the SAOextractions where all extracted words are the originals of the request(FIG. 12).

[0044] Then the method normalizes these words (modifies) each as eachaction is changed to its infinitive form. Thus, “is isolated” FIG. 12 ischanged to “ISOLATE”, the word “to” being understood (FIG. 13). Itshould be understood that not all attributes of the subject, action andobjects appearing in FIG. 11 are shown in FIGS. 12 and 13, but thesystem know the full attributes associated with the SAO elements andthese attributes are part of the SAO structure. Also, note in FIG. 13,no subject is listed for the last action because is indicated pursuantto the planning rules. This absence does not affect the reliability ofthe overall method because all sentences of the candidate documents theinclude an A-O of Isolate-slides will be considered a matter regardlessof the subject. The normalized SAO's are called herein as SAOstructures. These users request SAO structures are stored and applied intow following steps (i) synthesis of key word/phrases of user request;(ii) a comparative analysis of SAP structure of each sentence of eachcandidate documents as described below.

[0045] The request SAO structure key words/phrases are stored and sentto a standard search engine to search for candidate documents in localdatabases, LANs and/or the Web. Alta Vista™, Yahoo™, or other typicalsearch engines could be used. The engine, using the request SAOstructure key words/phrases identifies candidate documents and storesthem (full text) for system 10 analysis. Next the SAO analysis asdescribed above for the search request is repeated for each sentence ofeach candidate document so that SAO structures are generated and storedas indicated in FIG. 3. In addition, the SAO structures of each documentare used in the comparative steps where the request SAO structures arecompared with the candidate document SAO structures. If no match isfound then the documents and related SAO structures are deleted from thesystem. If one or more matches are found then the document and relatedstructures are marked relevant and its relevancy marked for example on ascale of 1.0 to 100. The full relevant document text is permanentlystored (although it can later be deleted by user if desired) for displayor print-out as user desires. Relevant SAO structures are also markedrelevant and permanently stored.

[0046] Next System 10 filters out the least relevant SAO structures anduses the matched SAO structures of each relevant document to synthesizeinto natural language summary sentence(s) the matched SAO structures andthe page number where the complete sentence associated with the matchedSAO structures and the page number where the complete sentenceassociated with the matched SAO structure appears. This summary isstored and available for user's display or print-out as desired.

[0047] Filtered relevant SAO structures of relevant document(s) areanalyzed to identify relationships among the subjects, actions, andobjects among all relevant structures. Then SAO structures are processedto reorganize them into new SAO structures for storage and synthesisinto natural language new sentence(s). The new sentences may andprobably some of them will express or summarize new ideas, concepts andthoughts for users to consider. The new sentences are stored for userdisplay or pint-out.

[0048] For example, if

[0049] S₁-A₁-O₁

[0050] S₂-A₂-O₂

[0051] S₃-A₃-O₃

[0052] and S₁ is the same as or a synonym of O₃, then S₃-A₃-S₁-A₁-O₁ issynthesized into a new sentence and stored.

[0053] Accordingly, the method and apparatus according to the presentinvention provides use automatically with a set of new ideas directlyrelating to user's requested area of interest some of which ideas areprobably new and suggest possible new solutions to user's problems underconsideration and/or the specific documents and summaries of pertinentparts of specific documents related directly to user's request.

[0054] Although mention has been made herein of application of thepresent system and method to the engineering, scientific and medicalfields, the application thereof is not limited thereto. The presentinvention has utility for historians, philosophers, theology, poetry,the arts or any field where written language is used.

[0055] It will be understood that various enhancements and changes canbe made to the example embodiments herein disclosed without departingfrom the spirit and scope of the present invention.

We claim:
 1. A natural language document analysis and selection systemcomprising, a general purpose computer having a monitor, a centralprocessing unit (CPU), a user input device for generating request datarepresenting a natural language request, and a communications device forcommunication with local and remote natural language document databases,said CPU comprising (i) first storage means for storing the requestdata, (ii) a semantic processor for generating requestsubject-action-object (SAO) extractions in response to receiving requestdata, and (iii) SAO storage means for storing representations of therequest SAO extractions.
 2. A system as set forth in claim 1 , whereinsaid communication device conveys candidate document data to said CPUfor storage in said first storage means, the candidate document datarepresenting natural language document text, said semantic processorgenerating candidate document SAO extractions in response to receivingcandidate document data, and said SAO storage means also storingrepresentations of candidate document SAO extractions.
 3. A system asset forth in claim 2 , wherein said semantic processor identifiesmatches between said representations of said request SAO extractions andsaid candidate document SAO extractions.
 4. A system as set forth inclaim 3 , wherein said semantic processor comprises means for marking asrelevant candidate document data that includes at least onerepresentation of candidate document SAO extraction that matches atleast one representation of request SAO extraction.
 5. A system as setforth in claim 4 , wherein said semantic processor comprises means fordeleting stored candidate document data and stored representations ofcandidate document SAO extractions for those documents that have norepresentation of candidate document SAO extraction that matches arepresentation of request SAO extraction.
 6. A system as set forth inclaim 3 , wherein said semantic processor includes an SAO text analyzerhaving a plurality of stored text formatting rules, coding rules, wordtagging rules, SAO recognizing rules, parsing rules, SAO extractionrules, and normalizing rules for applying such rules to the request dataand candidate document data such that said representations of candidatedocument SAO extractions and of request SAO extractions comprisecandidate document and request SAO structures, respectively.
 7. A systemas set forth in claim 6 further comprising second storage means forstoring request SAO structures and for applying SAO structures as keywords/phrases to said communication device for application to documentsearch engines on the WEB or local databases to cause downloading ofcandidate document data to the system.
 8. A system as set forth in claim6 further comprising an SAO synthesizer for generating and storing fordisplay on said monitor natural language summaries of marked documentsin response to receipt of document SAO structures.
 9. A system as setforth in claim 6 further comprising an SAO synthesizer for analyzingrelationships among subjects, actions, and objects among relevant andstored SAO structures and processing those SAO structures that have arelationship with at least one other SAO structure to generate adifferent SAO structure and storing the different SAO structure fordisplay to the user.
 10. A system as set forth in claim 9 wherein saidrelationship comprises: S₁-A₁-O₁ S₂-A₂-O₂ where S₁ synonym O₂ ThenS₂-A₂-S₁-A₁-O₁.
 11. In a digital data processing system including theWorld Wide Web and a general purpose computer having a monitor, acentral processing unit (CPU), a user input device, and a communicationsdevice for communication with local and remote natural language documentdatabases, the method of analyzing and selecting natural languagedocuments comprising, generating request data representing a naturallanguage request, storing the request data, semantically processing therequest data to generate request subject-action-object (SAO)extractions, and storing representations of the request SAO extractions.12. The method as set forth in claim 11 , wherein said communicationdevice conveys candidate document data to said CPU, the candidatedocument data representing natural language document text, storing thecandidate document data, said semantically processing includinggenerating candidate document SAO extractions in relation to thecandidate document data, and storing representations of candidatedocument SAO extractions.
 13. A method as set forth in claim 12 ,wherein said semantically processing includes identifying matchesbetween said representations of said request SAO extractions and saidcandidate document SAO extractions.
 14. A method as set forth in claim13 , wherein said semantically processing comprises marking as relevantcandidate document data that includes at least one representation ofcandidate document SAO extraction that matches at least onerepresentation of request SAO extraction.
 15. A method as set forth inclaim 14 , wherein said semantically processing comprises deletingaccess to stored candidate document data and stored representations ofcandidate document SAO extractions for those documents that have norepresentation of candidate document SAO extraction that matches arepresentation of request SAO extraction.
 16. A method as set forth inclaim 13 , wherein said semantically processing includes applying aplurality of stored text formatting rules, noun and verb recognitionrules, coding rules, word tagging rules, SAO recognizing rules, parsingrules, SAO extraction rules, and normalizing rules to the request dataand candidate document data such that said representations of candidatedocument SAO extractions and representations of request SAO extractionscomprise candidate document and request SAO structures, respectively.17. A method as set forth in claim 16 further comprising storing requestSAO structures and applying SAO structures as key words/phrases todocument search engines on the WEB or local databases to causedownloading of candidate document data to the CPU.
 18. A method as setforth in claim 16 further comprising generating and storing anddisplaying on said monitor natural language summaries of marked relevantdocuments in relation to relevant document SAO structures.
 19. A methodas set forth in claim 16 further comprising analyzing relationshipsamong subjects, actions, and objects among relevant and stored SAOstructures, further processing those SAO structures that have arelationship with at least one other relevant and stored SAO structure,and generating a different SAO structure based on the said relationship,and storing the different SAO structure and displaying the different SAOstructure to the user.
 20. A method as set forth in claim 19 whereinsaid relationship comprises: S₁-A₁-O₁ comprises one relevant and storedSAO structure S₂-A₂-O₂ comprises a second relevant and stored SAOstructure where said relationship comprises S₁ synonym O₂ and thedifferent SAO structure is S₂-A₂-S₁-A₁-O₁.
 21. A method as set forth inclaim 19 wherein said relationship comprises: S₁-A₁-O₁ comprises onerelevant and stored SAO structure S₂-A₂-O₂ comprises a second relevantand stored SAO structure where said relationship exists between S₁ andA₂ and the different SAO structure is S₁-A₁/A₂-O₁ where / meansalternate.